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Abstract 

We propose a novel criterion for support vector machine learning: 
maximizing the margin in the input space, not in the feature (Hilbert) 
space. This criterion is a discriminative version of the principal curve 
proposed by Hastie et al. The criterion is appropriate in particular 
when the input space is already a well-designed feature space with 
rather small dimensionality. The definition of the margin is general¬ 
ized in order to represent prior knowledge. The derived algorithm con¬ 
sists of two alternating steps to estimate the dual parameters. Firstly, 
the parameters are initialized by the original SVM. Then one set of 
parameters is updated by Newton-like procedure, and the other set is 
updated by solving a quadratic programming problem. The algorithm 
converges in a few steps to a local optimum under mild conditions and 
it preserves the sparsity of support vectors. Although the complexity 
to calculate temporal variables increases the complexity to solve the 
quadratic programming problem for each step does not change. It 
is also shown that the original SVM can be seen as a special case. 
We further derive a simplified algorithm which enables us to use the 
existing code for the original SVM. 


1 Introduction 

The support vector machine (SVM) is known as one of state-of-the-art meth¬ 
ods especially for pattern recognition The original SVM maximizes 
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the margin which is defined by the minimum distance between samples and 
a separating hyperplane in a Hilbert space TL. Even when the dimensionality 
of 7i is very large, it has been proved that the original SVM has a bound for a 
generalization error which is independent of the dimensionality. In practice, 
however, the original SVM sometimes gives a very small margin in the input 
space, because the metric of the feature space is usually quite different from 
that of the input space. Such a situation is undesirable in particular when 
the input space is already a well-designed feature space by using some prior 
knowledge [|, g. g [H| 0]. 

This paper gives a learning algorithm to maximize the margin in the in¬ 
put space. One difficulty is getting an explicit form of the margin in the 
input space, because the classification boundary is curved and the vertical 
projection from a sample point to the boundary is not always unique. We 
solve this problem by linear approximation techniques. The derived algo¬ 
rithm basically consists of iterations of the alternating two stages as follows: 
one is to estimate the projection point and the other is to solve a quadratic 
programming to find optimal parameter values. 

Such a dual structure appears in other frameworks, such as EM algorithm 
and variational Bayes. Much more related work is the principal curve pro¬ 
posed by Hastie et al||. The principal curve finds a curve in a ‘center’ of 
the points in the input space. 

The derived algorithm is not a gradient-descent type but Newton-like; 
hence we have to investigate its convergence property. It is shown that the 
derived algorithm does not always converges to the global optimum, but 
it converges to a local optimum under mild conditions. Some interesting 
relations to the original SVM are also shown: the original SVM can be seen 
as a special case of the algorithm; and the number of support vectors does not 
increase so much from the original SVM. The algorithm is verified through 
simple simulations. 

2 Generalized margin in the input space 

We consider a binary classification problem. The purpose of learning is to 
construct a map from an m-dimensional input x G to a corresponding 
output y G {±1} by using a finite number of samples (aq, y i),..., ( x n , y n ). 

Let us consider a linear classifier, y = sgn[/(a?)], where f(x) = to ■ 0(a?) + 
/ 0 ; 4>{x) is a feature of an input x in a Hilbert space H, to G 7i is a weight 
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parameter and / 0 G 3? is a bias parameter. Those parameters to and / 0 define 
a separating hyperplane in the feature space. As a feature function <p(x), we 
only consider a differentiable nonlinear map. 

A margin in the input space is defined by the minimum distance from 
sample points to the classification boundary in the input space. Since the 
classification boundary forms a complex curved surface, the distance cannot 
be obtained in an explicit form, and more significantly, a projection from a 
point to the boundary is not unique. 

Here, the metric in the input space is not necessary to be Euclidean. Some 
Riemannian metric G(x) may be defined, which enables us to represent many 
kinds of prior knowledge. For example, the invariance of patterns]?], (l^ can 
be implemented in this form. Another example is that Fisher information 
matrix is a natural metric, when the input space is a parameter space of some 
probability distribution |2|, |SJ. Although the distance is theoretically prefer¬ 
able to be measured by the length of a geodesic in the Riemannian space, 
it causes computational difficulty. In our formulation, since we only need 
a distance from a sample point to another point, we use a computationally 
feasible (nonsymmetric) distance from a sample point a:* to another point x 
in the quadratic norm, 



where Gt = G(xi). 

For simplicity, we mainly consider the hard margin case in which sample 
points are separable by a hyperplane in the Hilbert space. The soft margin 
case is discussed in the section |5|. 

Let x* be the closest point on the boundary surface from a sample point 
xi, and di = x* — Xi. Since cZ, is invariant under a scalar transformation of 
(a;, /o), we can assume all points are separated with satisfying 


di\\ 2 Gi > 1/u -u, i = 1, ■ ■ ■ ,n, 


( 1 ) 


If we assume at least one of them is an equality, the margin is given by 
1 f \Joo ■ to. Then we can find the optimal parameter by minimizing a quadratic 
objective function to ■ to with the constraints ([I]) and yif(xi ) > 0. 

In order to solve the optimization problem, we start from a solution of the 
original SVM and update the solution iteratively. By two kinds of lineariza¬ 
tion technique and a kernel trick which are described in the next section, we 
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obtain a discriminant function at the fc-th iteration step in the form of 

/O) = 51 *) + b i k)T *)} + /o fc) > ( 2 ) 

ies.v. 

where S.V. is a set of indices of support vectors, k(a;, y) is a kernel function 
and k X (x,y) is its derivative defined by k x (x, y) = dk{x,y)/dx. We have 
two groups of parameters here: One is of ci;, 6 * and /o which are parameters 
of linear coefficients, and the other is of Xi which is an estimate of the pro¬ 
jection point x* and forms base functions, a* and /o are initialized by the 
corresponding parameters in the original SVM and the other parameters are 
initialized by 6 , = 0 , Xi = X{. 


3 Iterative QP by linear approximations 

In this section, we overview the derivation of update rules of those parame¬ 
ters. The resultant algorithm is summarized in sec. |3.6| . 


3.1 Linear approximation of the distance to the bound¬ 
ary 

Suppose an estimated projection point Xi is given, we can get an approximate 
distance ||ct||G, by a linear approximation^]. Taking the Taylor expansion 
of 

f(x*) = 0 around x^ up to the first order, we obtain a constraint on d h 

f{xi) + Vf (xj) J (di - di) = 0, 

where d^ = x^ — x,- L . Minimizing 11d, : 11 q. under this constraint, we have 


. 1,2 _ ' { 0 (* i ) - + / 0 ) 2 

i|lG ‘ IkX&OlIJri 


(3) 


where = V0(xj) G 'H m . Note that this approximate value is unique, 

and it is invariant under a scalar transformation of (u;,/o). Moreover, the 
approximation is strictly correct when Xi = x* and V/(cc*) 7 ^ 0. 
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3.2 Linearization of the constraint 

Using the approximate value of the distance, we have a nonlinear constraint, 


Vi 


uj ■ {(p(xi) - V>( Xi) T di } + /o 


> 


\\uj ■ ^(Xi)\\ G -i 

\Juj ■ UJ 


(4) 


Since the constraint is nonlinear for u;, we linearize it around an approximate 
solution uj = uj which is the solution at a current step. This linearization not 
only simplifies the problem, but also enables us to derive a dual problem. 
Let gi(uj) be the right hand side of (Q), the first order expansion is 


gi(w) = 9i(&) + ( dgi(Cj)/duj ) • (a j-Cj). 


Now let gt = gi(u),fji = dgi(u)/du, then we have a linear constraint for uj 1 

uj ■ [yi {0(®j) - ^(xiYdi) - fji] >gi- foyi, (5) 

where we used the fact uj -fji — 0. Suppose q { = uj ■ ij)(xi) and f = uj ■ u, then 
gi and fji are given by 



By the above linearization, we can derive the dual problem in a similar way 
to the original SVM, 

W(ol) = g t c* t 

i 

Y Xi) J di } - fji] ■ [yj{<f>{xj) - ij>(Xj) T dj } - r)j], 

which is maximized under constraints a* > 0 
and J2i a iVi — 0- The solution uj is given by 

uj = Y <^i[yi{(j){xi ) - ^{xiYdi} - fji]. ( 7 ) 

i 

Here we can see an apparent relation to the original SVM, i.e., by letting 
x, = ajj, fji = 0, and g^ = 1, we have the exactly the same optimization 
problem as the original SVM. 
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3.3 Kernel trick 


In order to avoid the calculation of mapping into high dimensional Hilbert 
space, SVM applies a kernel trick, by which an inner product is replaced by 
a symmetric positive definite kernel function (Mercer kernel) that is easy to 
calculate || [|. [7|, |I2|1 . In our formulation, 4>(x) ■ 4>(y) is replaced by a Mercer 
kernel k(a;, y ). We also have to calculate the inner product related to (the 
derivative of 0). Let us assume that the kernel function k is differentiable. 
Then, i/K*) ' 0(y) is replaced by a vector k X (x,y) = dk(x } y)/dx, and 
i0(ai) ■ V’(?/) T is replaced by a matrix K xy (x, y) = <9 2 k(ai, y)/dxdy J . 

Now we can derive the kernel version of the optimization problem. In ([7|) . 
rji G H has bases related to 10(5:0 an d &■> anc i the solution u has bases 0(5:0 
additionally. Although u ; can have any kinds of bases, we restrict it in the 
following form to avoid increasing number of bases. 

u = ^{a^Xi) + 6 ^(*i)}. 

I 

Then we have q i = J2j{djk. x (xi, Xj) + K xy (Xi, Xj)bj}. Now let 
pi = LU- 0(5:0 = ^{a j k(a; j , 5:0 + bk,.{xj- *»)}, 

3 

then r is given by f = J2i(&iPi + Qi), and <0 by (H). Further, let us define 
additional temporal variables that represent several terms in the objective 
function, 

s i: j = {< t>(xi ) - il)(xi) T di} ■ {(f>(xj) - ^{xjYdj} 

k(5?j, Xj ) T d j K a ,y(xj, Xj)dj d ) k 3 ,(&j, Xj ) dj k X (xj, 5:0, 

Uj = Vi ■ {0(®j) - ^(xj) T dj} 


1 

Oif 

'U'ij Vi * Vj 




IQiWg- 1 


(.Pi - dj 9,0 h 


qjGi 1 K xy (x i ,x j )G j 1 q j - 


gi9j r 


j)^j bj 


•ni g: 


-i 9 


jl'GT 




then we have the objective function in a kernel form, 


II (a) 'y ( f/i^i _ y ( oijQij(yiiij Sg yjtij ijjtji ~\~ v>ij ), (8) 

* ^ i,j 
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which is maximized under constraints 


ai > o, Y^yi a i = 0- ( 9 ) 

i 

The new parameters can be determined from ([?) by 

of +1 = a iVi + Ph, 

bf +l) = -aJyidi + ^LJ^+P k (10) 

V 9i' r ) 


where /3 = YCj a^lqj^-i/gyr 2 . 

V - T i 

As for the bias term /o, since the constraint (||) should be satisfied in 
equality for J = {i \ oii ^ 0} from the Kuhn-Tucker condition, we have for 
any i 6 J, 

/o" ViQi ^ ] ^jilJjSji £ji yiUjtij T UiUij ) (H) 

j 


From (|10|), we can estimate the number of support vectors. Let J be the 
indices of nonzero cq’s at the fc-th step, then the number of support vectors 
is bounded from upper by | J 0 U J\ U • • • U J^\. Since J & does not change much 
as long as the structure of classification boundary is similar, the number of 
support vectors is expected to be not so larger than the original SVM. 


3.4 Update of the approximate projection of the points 


To complete the algorithm, we have to consider the update of the approx¬ 
imate value of the projection point Xi which is initialized by a;*, otherwise 
the convergent solution is not precise what we want. If good approximates Cj 
and /o of the solution are given, we can refine x t iteratively in the same way 
as in sec. 0 Suppose cD = + bj ip(x° ld )}, the projection point 

xi can be estimated by iterating the following steps for l — 0,1, 2, 3, • • •, 


^.[ Z + 1 1 _ T . 


J]| 


Pi 


(*! q - ViYaf + fo 


'G~ 


( 12 ) 


where ccj°' is initialized by xf d ', and are dehned in a similar way as pi 
and q i: 


pf = & ■ 0(*f) 
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Note that locally maximum points and saddle points of the distance are 
also equilibrium states of (|T2|). The following proposition guarantees such a 
point is not stable. 

Proposition 1 A point Xi e is an equilibrium state of the iteration step 
m. when and only when the point is a critical point of the distance from 
Xi to the separating boundary, i.e., a local minimum, a local maximum or 
a saddle point. The equilibrium state is not stable when the point is a local 
maximum or a saddle point. 

Proof: It is straightforward to show that a point is an equillibrium state of 
the iteration step (|T2l), only when the point is a critical point of the projection 
point ||dj||^,. Without loss of generality, we can assume the uniform metric 
case Gi = /, because update rule (|12|) is invariant of a metric transformation. 
We consider the behavior around a critical point x*. Let xf = x* + e, for 
a sufficiently small vector e. One can show that xf is mapped into the 
separating hypersurface f(x) = Cj ■ f>(x) + fo — 0 for a small e after one step 
iteration. Therefore, we only consider the case xf is on the hypersurface. 

Since x* is a critical point of the distance, the tangent vector V f{x*) is 
collinear to the distant vector di = x* — x^, i.e., for some constant A, it holds 


V/«) = A di. (13) 

Furthermore, if xf 1 is in a point of f(x) = 0, V/(:e*) is nearly orthogonal to 

S 1 . 6 . 

Vf(x*) T s ~ 0. (14) 

By expanding ([T2D around x*, we have a new estimation xf +1 ^ by 

dJV 2 f(x*)e 


x [ l +l] ~ x* + -V 2 f(x*)e 


I di 


~ di , 


(15) 


where V 2 / is a hessian matrix of f(x). Without loss of generality, we can 
take the coordinate of x as follows: the first coordinate is the direction of d i} 



and the second to the m-th coordinates are taken orthogonally such that an 
(m — 1) x (m — 1) submatrix of V 2 /(a?*) for those coordinates is diagonalized, 
i.e., V 2 f(x*) is in the form, 


V 2 f(x* 


f Cl b T \ 

c 2 0 


b 

\ 0 Cm) 


( 16 ) 


Under this coordinate system, since E\ is of small order value, the first element 
calculated from the second and third term in (|I5D vanishes and we have 


xf +1] - x* ~ i(0, c 2 £ 2 , • ■ ■, c m £ m ) J . (17) 

The iteration step is stable at x* only when ||a;f +1 ^ — x* II < l|Ve||, ie., 
t|cj| < | A| for all j — 2,.. . ,m. □ 

The condition for 1 -j plane is shown in hgure |1]. 


ej 



When the point is a local maximum or saddle, the hypersurface is in the 
unstable region. However, even in the case of local minimum, there exist an 
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unstable region, when the hypersurface is stronglly curved. We can avoid 
the undesired behavior by slowing down. For example, first c 2 ,... ,c m and 
A are estimated from V/ and V 2 / values at the current estimate, and then 
if Cj < | A | for all j = 2, ... ,m, the point is to be local minima, then the 
movement xf +1 ^ — xf to the axes in which Cj < — |A| should be shrinked by 
multiplying some factor 0 < ej < A | /1 a, |. 

This computationally intensive treatment would be usually necessary only 
after the several steps, because it is considered that the unstablity for local 
minima occurs a small region relatively to the size of dj. 


3.5 Projection of the hyperplane 


The update of ay causes another problem: We assumed in section [T72] that 
u and Cj have the same bases. However, Cj has bases based on the old ay, 
while we need the new u based on the new ay. To solve that problem, Cj is 
projected into new bases, i.e., from the old one cD old = Ejes.v.'R 1< V( : * : i ld ) + 

6 ° ldT ^(*° ld )} to a new one, (h new = E; e s.v.R new <K*r W ) + 

Although u) new can have more bases other than S.V., we restrict the bases to 
support vectors to preserve the sparsity of bases. 

There are several possibilities of the projection. In this paper, we use the 
one which minimizes the cost function 


\ E {*” • Hx) + fr - ■ ■Hx) + fs u )} 2 , (is) 

z xgt 


where T is a certain set of x, and we use T = {a?j, cc° ld , 5:° ew ; i — 1, ■ - -, n}. 

Minimizing (|l^) leads to a simple least square problem, which can be 
solved by linear equations. Another possibility of the cost function is ||o) new — 
a) old || 2 , which leads to another set of linear equations. 


3.6 Overall algorithm and the convergence property 

Now let us summarize the algorithm below. 

Algorithm 1: Algorithm to maximize the margin in the input space 

Initialization step: Let the solution of the original SVM be a[°' 1 and /g 0 ^; 
let h| 0) = 0 and = ay. 

For k — 0,1, 2,..., repeat the following steps until convergence: 
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1. Update of xp Calculate x[ k+1 ' 1 by applying (|1^) iteratively to x[ k \ 

2. Projection of hyperplane: Calculate a,, 6, and f 0 based on x^ +V> by 
a certain projection method from a\ k \ and /^ based on xf } 
(sec. |3.5|) . 

3. QP step: Solve the QP problem (||) with respect to cq. 

4. Parameter update: Calculate a[ k+1 \ b\ k+1 ^ and /g fc+1 ' ) by (|IDD and (|TT]). 
The discriminant function at the k-th step is given by 

Although Algorithm 1 does not always converge to the global minimum, 
we can prove the following proposition concerning about the convergence of 
the algorithm. 

Proposition 2 Equilibrium points of Algorithm 1 are critical points of the 
margin in the input space. The algorithm is stable, when the update rule of 
x, ([IJjj is stable for all i (see also Proposition 1). 

This proposition can be proved basically by proposition 1 and the fact that 
the linearization of QP is almost exact by a small perturbation of u. As in 
the case of (|T2|), we can modify the algorithm by slowing down in (j^) and 
( fl2 ) so that the equilibrium state is stable when and only when the margin 
is locally optimal. However, we don’t use it in the simulation because the 
case that the local minimum is unstable is expected to be rare. 

Another problem of Algorithm 1 is that each iteration step does not 
always increase the margin monotonically. Although it is usually faster than 
gradient type algorithms, the algorithm sometimes does not improve the 
solution of the original SVM at all. Because the original SVM can be seen 
as a special case of the algorithm, we can use some annealing technique, 
for example, updating temporal variables and parameters more gradually 
from their initial values. However, for simplicity, we use a crude method 
in the simulation as follows: Repeat several steps of the algorithm (5 steps 
in the simulation) and then choose the best solution which gives the largest 
estimated value of the margin. 

As for the complexity of the algorithm, we need 0(m 2 n 2 ) space and 
0(m 3 n 2 ) time complexity to calculate temporal variables if the computa¬ 
tion of a kernel function is 0(m), while the original SVM requires 0(n 2 ) 
space and 0(mn 2 ) time. Those calculation can be pararcllized easily. This 
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complexity is not so different when m is comparatively small. Once the vari¬ 
ables are calculated, the complexity for QP is just the same. Therefore, as 
far as the calculation for temporal variables is comparative to the QP time, 
the proposed algorithm is comparative to the original SVM. If the Algorithm 
1 is heavy because of the large m, we can use a simplified algorithm as shown 
in the section || 

As for the iteration of QP which is carried out usually for a few steps, 
since a current solution is an estimate of the solution, it may be able to 
reduce the complexity of the QP at the next iteration step. 

4 Simulation results 

In this section, we give a simulation result for artificial data sets in order 
to verify the proposed algorithm and to examine the basic performance. 20 
training samples and 1000 test samples are randomly drawn from positive and 
negative distribution, each of which is a Gaussian mixture of 3 components 
with uniformly distributed centers [0, l) 2 and fixed spherical variance cr 2 = 
0.2 2 . The kernel function used here is a spherical Gaussian kernel with cr 2 = 
l 2 . The metric is taken to be Euclidean (i.e., G t is the unit matrix). Figure 
[2] and |3| show an example of results by the original SVM (initial condition) 
and the proposed algorithm (after 5 steps). In this case, the margin value 
increases from 0.040 to 0.096. Such a simulation is repeated for 100 sets of 
samples with different random numbers. 

The estimated margins in the input space for the original and proposed 
algorithm is shown in figure [4] (log-log scale). By the crude algorithm de¬ 
scribed in the previous section, there are 4 cases among 100 runs that cannot 
improve the margin of the original SVM. The ratios of the margin are dis¬ 
tributed from 1.00 (no improvement) to 27.9. 

The misclassihcation errors for test samples is shown in figure |5|. The 
ratios of error distributed between [0.40(best),1.37(worst)j. 

This results indicates that the margin in the input space is efficient to 
improve the generalization performance in average, but there are cases that 
cannot reduce the generalization error even when the margin in the input 
space increases. 
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Figure 2: Result of the original SVM (margin .040). Circles (o) and crosses 
(x) are positive and negative samples. Squares (□) represent estimates of 
the projection of the points by applying (|T2D for 10 steps. 
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Figure 3: Result of the algorithm 1 (after 5 steps, margin .096) for the same 
data set as hg.|] 
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5 Soft margin 

For noisy situation, the hard margin classifier often overfits samples. There 
are several possibitilities to incorporate the soft margin, here we give a simple 
one. The soft margin can be derived by introducing slack variables z t into 
the optimization problem. If we use a soft constraint in the form 

uj ■ [yi {0(*j) - V>( &i) T di} - fji} >g t - f 0 yi - Zi , (19) 


and adding penalty for the slack variables, 

-u-u + C'Y^Zi, ( 20 ) 

i 

By this modification, only the constraint (0) for a* is changed to 

0<«*<C, = (21) 


which is the same constraint as the soft margin of the original SVM. However, 


the geometrical meaning of (|T9| ) in the space is not clear. It is a future work 
to introduce a natural soft constraint in the input space. 


6 Simplified algorithm for a high dimensional 
case 

Although Algorithm 1 achieves the precise solution, the computation costs is 
high for large dimensionality of inputs. In this section, we give a simplified 
algorithm. 

If we don’t update the first and the second steps of Algorithm 1 is not 
necessary any more. This simplification makes Algorithm 1 a little simpler 
because all d { terms vanish. However, let us consider further simplification. 

We have shown the relation to the original SVM: the original SVM can 
be derived g t — 1 and fji = 0. Since fji causes many temporal variables, we 
only maintain g t . Then all the terms related to 6j’s vanish. 

Consequently, the above simplifications lead to the algorithm much like 
the original SVM. In fact, the existing code for the original SVM can be used 
as follows: 
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( 22 ) 


For each step, first g % is calculated, 

|| Ojk 3 ,(ajj, ||g-i 

af ) 4 fc) k(* j ,a; fc ) 

Then, by letting the (i,j) element of kernel matrix be k(ccj ,Xj)/§igj, the 
original SVM for this kernel matrix gives the solution for each step of the 
simplified algorithm. 

7 Conclusion 

We have proposed a new learning algorithm to find a kernel-based classifier 
that maximizes the margin in the input space. The derived algorithm con¬ 
sists of an alternating optimization between the foot of perpendicular and 
the linear coefficient parameters. Such a dual structure appears in other 
frameworks, such as EM algorithm, variational Bayes, and principal curve. 

There are many issues to be studied about the algorithm, for example, 
analyzing the generalization performance theoretically and Ending an effi¬ 
cient algorithm that reduces the complexity and converges more stably. It 
is also an interesting issue to extend our framework to other problems than 
classification, such as regression^, |], 0. 

In this paper, we have assumed that the kernel function is given and 
fixed. Recently, several techniques and criteria to choose a kernel function 
have been proposed extensively. We expect that those techniques and much 
other knowledge for the original SVM can be incorporated in our framework. 
Applying the algorithm to real world data is also important. 
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